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(57) Abstract: 

PURPOSE: To perform the classification and prediction 
of data outputted frequently from an information source 
with high accuracy by picking up the data from decision 
which maximizes an information gain, and classifying the 
data preferentially from the decision with high 
identification capacity. 



of a decision list as a logical product which regulates 
the decision of the decision list under a state where 
the number of characters comprising one logical product 
is fixed. And a stopping rule based on an error 
classification rate is provided, and the addition of the 
decision is stopped at a point where it is satisfied, 
then, the decision list obtained at that time is 
outputted. In such a way, it is possible to classify and 
predict unknown data generated from the same 
information source as that for the measuring data with 
high accuracy. 



CONSTITUTION: The decision with high information gain 
for measuring data is added sequentially as the decision 
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A I ( t ) - {A ( t ) -A ( t ) 
(t) +1) 

^T, 5nhZ7£%c^ H (x) =-xlogx- (l-x) 1 
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^^rt^Lr, *<7> i ) ,ii) ,iii> (D&ntmm-rz 

i) rain {A + ( t ) , A" ( t ) } /A ( t ) > a (-0.2 
5) ttfetf, 

B + (t) ^B" (t) *ib« (true, 1) t5T^^ 
fcl* (true,0) S:**!*^^ LT^OLTffEJh-TS. 

ii) rain {B + ( t ) , B~ ( t ) } /B ( t ) < j3 (=0.2 
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DL * (2) r% 
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h*rtU*r5^««:Afl»Lrv^«!, DL* (2) 
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startTW, 2m(Dnm<Dm&t 21 

^ S ^ S^^iXfcn,k^^-LrT n k ^^^Mi:-r6^ 
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i/^nr-, ^ h s/ tfv^i — /^co^^i: IT, 
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<true,0) ^BIT, -tit* "CttAD 

(4) r-«i?>fcnri.^flia*j»s:^-A t oB8* 

ibft5£-f £ e (^n^r t * ) fl»«*J»**o^ 

-Si. ^fj/^14T\ A + ( t *) >A" ( t *) t£h 

tfv=UU A + ( t * ) <A" ( t *) fcfetf v = 0 £ 
U (t*,v) £rft5£ V X h<D&:7£t LT:friC tttfMx. 

^fcft**»i!)TTi:L, t*&S:a«>Tt* £ U ^ 
l m<D*7- yfnfrhlS-HT? tm U 



14 



y;*^ (DL max (k) t-rz>) ic^fur, ^c^fj/ 

^26T-, DL max (k) £/8 ^ TfBi£T-# 54^^— ^ CD 

(ft^y ^ h(Df5^fi) + (1-1) (CF'Wt*— *0>iE 

DL max (k) <DMffi (mm St*a>fcX"J ^^Irnpc 
AttWl-tt, v (l) =DL max (k) £ IT, j^itc 
*fU V (j) 0*36»6> 2 0(7>ft*4- (t,v) (true, 
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t>0£V (j +1) ftZ> 0 tl, t' £r t ^EfttfOft 



1 (A + (t) + B + (t)^A"(t) + B-(t)C9^^) 



&fc, ^f^^sT, v (j +i) Srjfl^triEiCSix 

*U £*uSrLj + i£-t-6„ s/T^lcil^ V 

(i+l) ^tC^&£ft^2^oT^£^^? 
3»«:¥U»fU RoTi^tttftili, 7^:/31(Cxi^ fa 

(j) £fcb^U UTt^o ;:t\ t>u. 
jft^fi+ixtf, -t<o*/hffl{c^jcS-T-5V (j) SrttJStii 
*U j^Tt^c ^fy^lcfc^TV (j + 

» 3 h «a * 2 n is m % k -r § 7 * - ^ * - h r- 

&>5 0 starttC&^TIS, <t 2 fH<^ n GD^f£ 

£ 2 fit go ^ ^^^d^/^5«ay7 :f »-^o^:m^€:^]^fLl 

(k «t 1 ooftfiicffl^t,ns»aas:«*t53:*(7) 

**r»r*>t?, kOH55*<o8S*Mi:tS) . T n 

k&*JJHffi£-*-s*-&T (k) M^-At" -o 



om«&fDt s tttt< , s^tt^fp, -r^^fe. a (ft^ 
(o<x<\) t ^xnn~tz>m&. *>siMiftsy^ 

34T*. rix<bO=F«T-*/h<c k <7DMl-#J£;-t-6ft^y ^ h 
^^0T^5 O ^f^#-§-<b 2fSOnf@(^Mt^i: 2fl§CD^ 

T. T=T n k (k(i@^) , t' = 05:11^^ 
i) S 

i i ) rain {A + ( t * ) , A* ( t " ) } /A ( t * ) > a 
iii) rain (B + ( t * ) . B" ( t ' ) } /B ( t " ) < j3 
co^f^^r i ) ii) iii) (OjiHicP^. loTtitli* 
5(? SJ:5^»-&li. flfP^t' ^42tc^^-T6ft^{1- 
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(4) -C#*.6>n<5««*J»tH-»L, m«r**^r 

5J:5^^-At* £&r^U tkm& t » = t *,T=T 
-t*tc^x.T, t *£&3£tt;!iQ[^43tC^£ 0 »Sf+ 

( t *) ,A" (t *) SrtfJtLT, A + ( t *) >A" (t 
*) fcfetf , v=l,A + ( t *) .<A" ( t *) ftfetf, v 
= 0 £ U ( t * , v) y ;* h to^^i: Lt6l:tf 

i^P*., »C«Cffi«8«41^S-SA (t*) Sr42«CttiH&-f 

*»feK«44«CjSS D 44ttlB*|S*#S-SA (t*) £ 

*r4Uca<5. 44<offi^4rS(jr«rfc{c|B«SS«4i^6»^ 
-^m$ix5^, 42 H 9 ig-*-. 43(C 

42A»bfs*p^iSbnr*fc»-8'{cii, b + u" ) >b~ 
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-A) (W^-r-^tDlH^*) (0<A<1) t LTff 
S^S*^ *>SlMi#fcSy * hOlSxPfii-W^^— ^ 
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v) (true, v' ) rftfe*: (true,v*) fc 

55tt^n^nxui9 5i^tcj:or»ibn3tgt3£y^ y<n> 

*rti*:ink* ^n^h,{c# LTff^ftfc 
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*£**:fcfc«U ^^J^fti:^r^r^>. » 2 ffi«R«56j&>fe|Ej£ 

fflttW^SriSSr^SrS^Sflr^^, S 2 
±®&58icSS!3iitP a 58l3570jf<fr«:S:i;fT , « 2 ffi« 
»«56^bl2^ftSr*/h(ci-Sttr3£y^ h^rffi^^^S 
-^Sr»^rsWJ»fS-&S:ai2ia«K«56«c:iSS, 12 
1B«gfi56tt58<z>Jt4*££:ttT, fB^Br^/Hc-f 5* 
^y ^ h^ttj^t^, 

r, BeA*(i$t-r. dl* (k) ig±©&6i(ci£9&£n 
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K l W»fS*«±»«62 J: 0 «±-r5fMWf**oJg*i;i 
SDL* (k) i5-*^(cffl^(bnfc^ — Aoo-tiL-eixtc 



^ 2TfSA + (t) ,A' (t) ^ffi^ffiicfS^^ff^{H]^ 6 3 

ica&Diz^ ^ktc^r-rsDL* (k> ^ih^^#64ic^ 

fi^#^fo, r*t>^, a (^r^y^hcota^ 
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jj&AJj t IT, rixb^fB^-rSo 65(i63COttJ*«:A 
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*/hffiS:3ftii>, I£1&KB64a>bffi;£J;£&/Jstc-rS 

50 (mm^m) 
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Under Condition 14, according to [11] and [13], we define 
the generalized inverse Junction of L 0 by Lq 1 (L 0 (Z)) = Z 
for 0 < Z < 1 and L^ X {Z) > 1 for Z > L 0 (l). 
Similarly, we define the generalized inverse function of L x 
by Li X (L x (Z)) = Z for 0 < Z < 1 and L^(Z) < 0 for 
JZ > L^O). 

Example 5: For the entropic loss function 

L(Y, Z) = Y \n{Y/Z) + (1 - Y) ln((l - Y)/(l - Z)) 

-; we see that L 0 (Z) = - ln(l - Z) and Li(Z) = - In Z, Then 
we have L^\Z) = 1 - e~ z and ^(Z) = e~ 2 . A simple 
calculation yields A* = 1 (see [11]). We see that Conditions 
14—16 are satisfied for L. 

Example 6: For the quadratic loss function L(Y, Z) = 
(Y - Z) 2 , we see that L 0 (Z) = Z 2 and L X (Z) = (1 - Z) 2 . 
Then we have L^ l (Z) = y/Z and L^Z) = 1 - n/Z. 
A simple calculation yields A* = 2 (see [11]). We see that 
Conditions 14—16 are satisfied for L, 

Example 7: For the Hellinger loss function 
L{Y, Z) = \ (( VY - VZ) 2 + (VT=Y - y/T^Z) 2 ) 

we see that L 0 (Z) = 1 - y/1 - Z and L X {Z) = 1 - s/Z. 
Then we can set 

LoHZ) = 2Z- Z 2 (0 < Z < 1), Lo X (Z) = 1(Z > 1) 
and 

Zf 1 ^) = (1 - Z) 2 (0 < Z < 1), Lf x (Z) = 0(Z > 1). 

A simple calculation shows A* = y/2 (see [11]). We see that 
Conditions 14 and 15 are satisfied, but Condition 16 is not 
satisfied. 

Example 8: For the logistic loss function as in (3), we see 
that 

L 0 (Z) = (1/2) In ((e 2 ^- 1 ) + !)/(! + e " 2 )) 

and 

Li{Z) = (1/2) In ((1 + e- a < 2 *-D)/(l + e" 2 )). 
Then we have 

Lq X (Z) = 1/2 + (1/4) ln((l + e~ 2 )e 2 ^ - 1) 

and 

LT\Z) = 1/2 - (1/4) ln((l + e" 2 )e 2Z - 1). 

A simple calculation yields A* = 2. We see that Conditions 
14—16 are satisfied for L. 

Under the above notation, a variant of Kivinen and War- 
muth's version of the aggregating strategy for the case of 
y = [0, 1], which we denote by AGG, is described as follows: 

Algorithm AGG: Let 7i k , L, ?r, and A > 0 be given. 
At each time t, on receiving X t , compute 

A t (0)WL( 0 |£>«-\ X.) 

and 

A t (l)^L(l|/?«-», Xt ) 
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where the notation of L^D*- 1 , X t )(Y = 0, 1) follows (21). 
Then predict Y t with any value Y t satisfying 

Lf^A^l)) < Y t < Lo "(AtCO)). (26) 

If no such Y t exists, the algorithm fails. We write Y t = 
AGG* (Af t ). 

After prediction, the correct outcome Y t is received. 
Note that if LJ" l (A t (l)) < Lq 1 (A z (0)) holds, then Y t 
satisfying (26) is given by, for example, 

n = |(£r 1 (A.(l)) + Lj 1 (A t (0))). 
Note also that (21) can be written as 
L{Y\D*-\ X t ) 

= ~\ ln / MvtWexpi-imXt, Y) : f 0 )) 

where v t ($) = w t (0) / J w t {9) d$ and w t {9) follows the 
following updating rule: 

wt(6) = w t -x{Q) exp(-AL(A-i : fe)) 

which implies that w t {0) multiplicatively decays at each time 
and that A t (0) and A t (l) in AGG can be computed incre- 
mentally with respect to t. 

We say that L is X-realizable when AGG never fails for A. 
Haussler et al [11] proved that in the case of y — {0, 1}, 
under Conditions 14 and 15, L is A-realizable if and only if 
A < A* for A* as in (25). 

Kivinen and Warmuth [13] proved that if L satisfies Con- 
dition 16, then for Y t - AGG t (X t ) satisfying (26), for all 
Y t € [0, 1], the following inequality holds: 

L{(X U Y t ) : AGG t ) = L(D t : AGG t ) < L^p 4 " 1 , X t ), 

(27) 

This implies that Y t satisfying (26) is well-defined in the sense 
that the loss for AGG with respect to any real outcome Y t in 
[0, 1] is upper-bounded by L^D*' 1 , X t ). 

C Cumulative Loss Bounds for the Aggregating Strategy 

Below we give upper bounds on the cumulative loss for 
AGG using Hh* First note that summing both sides of (27) 
with respect to t gives 

m m 

J2 L i°t : AGG t ) < LWD'-^ Xt) = I(D m : H k ). 
*=i t=i 

Here we have used Lemma 1 to derive the last equation. 
Combining this inequality with (4) and (5) immediately gives 
the following theorem. 

Theorem 2. Upper Bounds on Cumulative Loss for AGG: 
Let y = [0, 1] and suppose that the loss function L satisfies 
Conditions 14-16. Let A* be the largest number such that L 
is A* -realizable, i.e., A* is as in (25). Let P be the target 
distribution. Under Conditions 1-7 for P, H kl L, and 7r, the 
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expected cumulative loss for AGG with A = A* using Tik is 
upper-bounded as follows: 



isupfs 

V U=i 

-( 



L(D t : AGG t ) 



mE[L(D:f eo )] + ^ hi 



2-k 



: )) 



where the notation of Oq and |J(#o)| follows Conditions 3 
and 5, respectively. 

Under Conditions 1 and 8-12 for Ti k , L„ and 7r, for all 
D m e P m (0jt), the cumulative loss for AGG with A = A* 
using Tik with respect to D m is upper-bounded as follows: 

Y,L{D t : AGG t ) < : /,) + ^ 

t=l t=l 

+ iln- + o(l) (29) 
A* rc 



where o(l) goes to zero uniformly in D m T> m (Ok) as 
m goes to infinity. The notation of V m (& h ), 0, /z, r, and c 
follows Conditions 9-12. 

For the case of y = {0, 1}, (28) and (29) hold under 
Conditions 14 and 15 only (i.e., without Condition 16). 

Define the minimum expected cumulative loss over Hk by 



E 



X>(A:/* 0 ) 



t=i 



= mE[L(D : /«,)] 



which cannot be attained without knowing P. Equation (28) 
implies that the expected cumulative loss for AGG is within 



_k_ 
2A« 



mX' 1 y/jira 

h 17 + v ln -W 



0(1) 



of the ininimum expected cumulative loss over Tik- Further 
define the minimum empirical cumulative loss over Tik by 

m 

mm £ L(A : /.) 



which is attainable using a single 0 € Ok but cannot be at- 
tained by any on-line prediction algorithm for most sequences. 
Equation (29) implies that the worst case cumulative loss for 
AGG is within 

k 

2A* 



. m\* u 1.1 
In ^ + — ln — + o(l) 
A* rc 



cumulative loss for AGG using Ti by the following quantity, 
which is an upper bound on the ESC of the form of (2): 



2tt 

of the minimum empirical cumulative loss over Ti k * where the 
worst case is taken with respect to D m over P m (0fc). 

Corresponding to (18) and (19), we are able to derive 
tighter upper bounds on the expected cumulative loss and the 
probabilistic cumulative loss for AGG. We omit the argument 
here. 

Finally, note that in the case where the hypothesis class is a 
union set Ti = U s k=1 Tik (s < oo) with respect to k instead of 
Tik for a single fc, under the conditions as in Theorem 2, for 
any D m € n£ =1 X> m (G fc ), we can upper-bound the worst case 



min < 



r , ^ „ v A: _ mA*/x 



__L In — - In Tr(Jfc) I + o(l) 
A* TC J 



(28) where A* is as in (25). 



Example 9: For Tik, L, and 7r as in Example 1, if the 
target distribution P satisfies Conditions 2-6, the expected 
cumulative loss for AGG for sample size m is upper-bounded 
by 



m 



m£[(y-#X) 2 ]+-ln- + -ln 



7T 



k/2 



4 7T 2 2 fc ]?(l + k/2) 



o(l) 



where Jo is the matrix for which the (i, j)lh component is 
Ep[X<nxW] and 6 0 = (J 0 ) _1 (^p^])- 
For given D m , let 



0 = (J) 



where J is a matrix for which the (i, j)th component is given 
by i YZLi X^x\ j) . Let X> m (e fc ) = {D m € V m : J is 
regular and 0 6 ©t}. Then setting A* =2, /x = 2fc, r = l/2 fc , 
and c = 2*T(1 + k/2)/^ k f 2 , for all £> m € X> m (e fc ), the 
cumulative loss for AGG with respect to D m is upper-bounded 
by 

+ o(l). 



£(y t -^) 2 + Jln— + 1 ln 



7T 2 r(i + */2) 



Example 10: For Ti, £, and 7r as in Example 2, if the 
target distribution P satisfies Conditions 2-6, the expected 
cumulative loss for AGG for sample size m is upper-bounded 
by 

mE[L(D : fa)] + \ ln ~ + j ln J 0 + o(l), 



where 



and 



( e 20 o -l + e -2^ 0 +l)2 



_ l i l + gppy-i] 

fo- 2 + 4 ln i_£ p [ 2 y -I]" 



For given D m , let 



m 

TT7. f * 



t=l 



m si 

Let T> m (e) = {£> m € P m : 0 € 0}. Then setting 
jfe = 1, A* = 2, /x = 2, r = 1/2, and c = 1, for all 
Z? m e 2> m (B), the cumulative loss for AGG with respect 
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to D m is upper-bounded by 

£ L(D t : f § ) + \ In ^ + ± ln2 + o(l). 



i=l 



Example 11: For H, and 7r as in Example 3, for given 
D m , let 

2 



(.5 



Let D m (G) = {D m eV m :0e 6}. Then setting k = 1, 
A* = ti = (l/2)F(e), r = 1/2, and c = 1/(1 - 2e), for 
all Z> m € Z> m (8), the cumulative loss for AGG with respect 
to D m is upper-bounded by 

y/2mF{e) 



Air 



V2 



In 2(l-2e)+o(l). 



Example 12: Consider 7i as in Example 4 as a class of 
real-valued functions through fe(X) = /$(1|X). Then H, L, 
and 7r as in Example 4, setting A* = 2, /x = 2, r = 1/2*, and 
c = 1, for all £> m 6 X> m , the cumulative loss for AGG with 
respect to D 771 is upper-bounded by 



mu(m — rriu) k , 2m k , 

— ^ ^ 4- - In + ^ln2 + o(l) 

mi 4 7r 2 w 



where is the number of examples for which X fell into 
Si{i = 1, fc) and m xi is the number of examples for 
which Y = 1 and X fell into S;(i = 1, • • ■ , &). 

IV. Applications of ESC to Batch-Learning 
A. Batch-Learning Model 

In this section we consider an application of ESC to the 
design and analysis of a batch-learning algorithm. In general, 
a batch-learning algorithm (see, e.g., [10]) is an algorithm that 
takes as input a sequence of examples: D m = D x • - - £> m e T>* 
(D t = (X t , Y t ), t — 1, * • , m) and a hypothesis class 7i, 
and then outputs a single hypothesis belonging to 7i. 

Let T be a set of all functions from X to Z in the case where 
Z c R> or let F be a set of all conditional probability densities 
(or probability mass functions) over y for given X G X in 
the case where Z is a set of conditional probability densities 
(or probability mass functions) over y for given X e X. 
Suppose that each D is independently drawn according to the 
target distribution P over D. For a hypothesis f £ 7Y, we 
define a generalization loss of / with respect to P by 

Ap(/)H f £p[L(£> : /)] - inf E P [L(D : h)} 

where Ep denotes the expectation taken for the generation of 
D — (X, Y) with respect to P. For sample size m, we also 
define a statistical risk for a batch-learning algorithm A as the 
expected value of the generalization loss: 

25[Ap(/)] 



where / is an output of A which is a random variable 
depending on the input sequence Z? m , and the expectation 
E is taken for the generation of D m with respect to P(D m ). 
Our goal is to design a batch-learning algorithm for which the 
statistical risk is as small as possible. 

B. The Minimum L-Complexity Algorithm 

Next we consider a batch-approximation of ESC by a single 
hypothesis to motivate a batch-learning algorithm. We now 
approximate the integral in (1) by quantizing 6*. For a k- 
dimensional parametric hypothesis class H k — {fe'. 0 € & k C 
R k } 9 let o£ m ^ be a finite subset of depending on sample 
size m. We define W^ m) (C 7i k ) by ?4 m)d ={/e = 0 € e< m) }. 
We refer to as a quantization of G k . Similarly, we refer 

to as a quantization of 7i k . We call a map r m : B k — *> 

©k"^ a truncation. Similarly, we also call a map 7i k — * 7i k m ^ 
defined by fe € 7i k — ► f Tm ($) 6 a truncation of For 

* € 0< m) , let S(0)=V € e k : r m (0') = 0}, which is a set 
of real- valued points which are truncated to 6 by r m . 

Choosing r m so that for each 6 e ©£ m \ the Lebesgue 
measure of 5(0) goes to zero as m increases to infinity, 
we may consider an approximation of ESC by the quantity 
J{D rn : H k ) defined as follows: 

J(D m : H k f= - y In £ W(0) 



■ exp 



where W(0)= f f e , eSW n{9') d0' for a given prior density 

*(*)■ 

Since 



£(A : fe) 



) 



> max W(6) exp 



the quantity J(D m : 7i k ) can be upper-bounded as follows: 
J(D m : H k ) 



< — — In max 



8€B 



(m) I 



= mm 



,{t 

I t=i 



exp l-\J^L(D t 



t-i 



■fe)j 



L(D t :f e )-j\nW(0) 



Notice here that W{6) can be thought of as a probability mass 
of 6 over ©i m ^ since 



Hence — In W(^) can be interpreted as the codelength for 0. 
This implies that letting L m be any function: S k m ^ —> i2 + 
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e" L ~ ( *> < 1 

0€e ( fc m) 

we can upper-bound J(D m : Tik) by 
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nun 

eee\ 



* lit 



L(D t : fe) 4- T £m(*) 



}• 



(30) 



This argument can be easily extended to a batch- 
approximation of ESC relative to a union set 7i = UfcW/t 
with respect to k. That is, ESC of the form of (2) can be 
approximated by 



mm min 

k see' 



in Ie 



L(D t 



fe) + 



} 



(31) 



where X m (* 
satisfying 



•) is a function UfcB^ 71 ^ 



{1,2, • • •} - fl + 



: e ■ 



5 -.L m (*,fc) < 1 



From the above discussion we see that the best batch- 
approximation of ESC can be realized by a single hypothesis 
that rninimizes the weighted sum of the empirical lossr for 
the hypothesis with respect to D m and the codelength for the 
hypothesis. This fact motivates a batch-learning algorithm that 
produces from a data sequence D m a hypothesis that attains 
the rninimum of (31). For a loss function L, we name this 
algorithm the minimum L-complexity algorithm, which we 
denote by MLC. 

In order to define MLC, we have to fix a method of 
quantization for O*. A question arises in how finely we should 
quantize a continuous parameter space to approximate ESC 
best. The optimal quantization scale can be obtained similarly 
with the argument by Rissanen [19, pp. 55-56] as follows: Let 
£ = ($!,..., S m ) be the maximal quantization scale around 
the truncated value r m (0) of 9 where 



6 = arg min ^\L{D t : fe) 



with 



dJ2L(D t :f e )/de\ e= t=0. 



t=l 



Applying Taylor's expansion of L(D t 
the second order, we have 



f§ +6 ) around 0 up to 



£ L(D t : f i+6 ) = £ X(A : f f ) + ^6 T T,6 + mo(S 2 ) 



t-1 

where 



d 2 Y:ZiL(D t :fe)\ 



m (dOidOj) 



L=3 



Since the codelength for 6 is given by - £i In Si + O(l), the 
rriinimization of a type of (30) requires that 



E 



In Si 



be minimized with respect to S. Supposing that £ is positive 
definite and that its maximum eigenvalue is uniformly upper- 
bounded by a constant, it can be verified that the minimum is 
attained by 6 such that 



»=1 N 



(Am)*/2|E|V2 



This quantization scale S also ensures that the rninimum loss 
over e£ m) is within O(kfX) of that over 8*. This nature for 
the fineness of quantization may be formalized as follows: 

Condition 17: For given A > 0, there exists a quantiza- 
tion of Hk such that for some 0 < B < oo, for all m, 
for all D m = !>! ••• £> m 6 P*, the following inequality 
holds: 



min 



E 



t=i 



fe) < inf 



/•) + - 



(32) 



where 6^ m) 



is a quantization of &k for sample size m. 
We are now ready to give a formal definition of MLC. 

Algorithm MLC: Let 7i = U fc Hjfc and L be given 



each k, for 
let = 
satisfying 



each m, fix a quantization of Hk 



For each m, fix L m : ?*< m > 



For 
and 



\p c -W/)<i 

and A, which may depend on m. 

Input: D™ e V* 

Output: f € ^ (m) such that 



(33) 



/ = arg min 



(34) 



In the case where the hypothesis class is a class of prob- 
ability densities and the distortion measure is the logarithmic 
loss function, MLC coincides with the statistical model se- 
lection criterion called the minimum description length (MDL) 
principle (see [14], [16]-[19, pp. 79-92]), letting A = 1. 

MLC is closely related to Barron's complexity regulariza- 
tion algorithm [2], which we denote by CR. Barron showed 
that CR takes the same form as (34) with respect to the 
quadratic loss function and the logarithmic loss function. For 
other bounded loss functions, however, Barron took CR to 
have the following different form of 



min 



in Ie 



L(D t 



f) + (mL, 



.(/)) 



1/2 



(35) 



where A is a positive constant. This form was taken to ensure 
bounds on the statistical risk of / by the method of analysis 
in [2], and no longer has interpretation as an approximation of 



1436 



IEEE TRANSACTIONS ON 



lTION THEORY. VOL. 44. NO. 4, JULY 1998 



ESC. Unlike CR of the form of (35), MLC offers a unifying 
strategy that always takes the form of (34) regardless of a loss 
function and A. Our analysis in the next section shows that 
MLC has satisfactory bounds on the statistical risk for general 
bounded loss functions. 

C Analysis of MLC 

We analyze MLC by giving upper bounds on its statistical 
risk. 

Theorem 5. Upper Bounds on Statistical Risk for MLC; 
Suppose that for some 0 < C < oo, for all D> for all 
/ <= 0 < L(D : /) < C. Let h(X)=\e xc - 1)/C. 
Assume that for the sequence D m = D x , J9 m G P*, 
each D t is independently drawn according to the unknown 
target distribution P. Let be a quantization of 7i. Then 

for any A > 0, the statistical risk for MLC using H = U k ?{ k 
is upper-bounded as follows: 

S[A F (/)] < /e in L {* m + + ^^}- 

(36) 

Specifically, if for given A > 0, for each k, the quantization 
of Hk satisfies Condition 17 and L m (f) takes a constant 
value over w£ m \ then the statistical risk for MLC using 
7i = UfcWfc is upper-bounded as follows: 

E[A P (f)) 

(L m (f) + Bk + 1)\ 



m/i(A) 



(37) 

where / € 7t (m) is a truncation of / € H, k is the number of 
real-valued parameters in /, and B is a constant as in (32). 

Proof: We abbreviate J2?=i L{D t : /) as L(D m : /). 
Choose / 6 H^ m) arbitrarily. Let / be an output of MLC. 

First note that if Ap(/) > e, then the hypothesis that attains 
the rninimum of the quantity: XL{D rn : f) + L m (/) over 
7^ m > lies in the range {/ € ?* (m) : A P (/) > e}. Thus 
Prob[Ap(/) > e] is upper-bounded as follows: 

Prob[A P (/)>e] 



< \L(D™:f) + L Tn (f)} 



= Prob 



max 



0 -AL(D-:/)-L m (/) 



_/€^< m ):Ap(/)>£ 

> e -AL(D-:7)-^(/)j 

53 Prob[e- Ai < x>m: ^- L ™^> 

> e -AL(/?-:/)--L m (7)j 



/6W("«):Ap(/)>e 



(38) 



Next we evaluate the probability (38). Let be the set of 
D m satisfying the event that 



For all / € tt< m \ we have 

Prob[e~ AI ' (£,m:/) " I ' m(/) > e -*L{n™ j)-L m Q)} 

dP(D m ) 

-\L(D™:f)-L m (f) 

dP(D m ) - = =- 

D~eE c -AL(JD~:/)-L m (/) 
< e - L ™(f)+ L m(f) J ^p(£>mj € -AL(Z? m :/)+A£(i>-:7) 

(39) 

Here we have used the independence assumption for D to 
derive the last equation. 

We use the following key lemma to further evaluate (39). 

Lemma 2: For / satisfying Ap(f) > e 
J dP(D)e- XWD '"- L ( D:7 » 

< exp [-h{\){e - (C 2 h(X) + Ap(7)))] (40) 

where h(X) = (e xc - 1)/C. 

By plugging (40) into (39), and then the resulting inequality 
into (38), we can upper-bound Prob[Ap(/) > e] as follows: 

Prob [AH/) >e] 



,-£m(/) 



/€^ m ):Ap(/)>e 

< exp — m/i(A) 



(«-(■ 



C 2 h(X) + A P (f) + 



)) 



m/i(A) 

where the last inequality follows from the fact 

e -L~U) < e- L "W < 1 



(41) 



/€H(") : A P (/)>£ 

by (33). Letting 



C 2 A(A) + A P (/) + 



mft(A) y 



(41) is written as 



Prob 



[ap(/) - ( 



C 2 /i(A) + Ap(/) + 



m/i(A) 



)>»1 



Hence the statistical risk for MLC is upper-bounded as fol- 
lows: 



£[Ap(/)]<C 2 /i(A)+A P (/) + 



mh{\) 



f 

Jo 



= C 2 MA)+Ap(/) + 



(42) 



m/i(A) 

Since (42) holds for all 7 6 7i^ m K we obtain (37) by 
minimizing the left-hand side of (42) with respect to / over 
H (m) . This completes the proof of (36). Bound (37) can also 
be proven quite similarly to (36) under Condition 17. □ 
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Proof of Lemma 2: We start with the following two for- 
mulas. 



Sublemma 1: For 0 < C < oo, for A > 0 

(0 < x < C) 



-Ax 



< 1 - 



C 

5 AC -1 



(-C < x < 0). 



(43) 
(44) 



Let 



and 



V(D:f,f) d ^L(D:f)-L(D:J) 



A(/ || f) d ^B P [V(D : /, 7)]- 



Then -C < V(D : f,f) < C. Thus making use of 
(43) and (44), we obtain the following upper bound on 

jf dP(D)e- x ^ D ^- L ( D: ^ : 



I 



-L 



i>:V(£>:/,/)>0 



dP(D)e 



-\V(D:f, f) 



V(Z>:/,/)<0 



dP(D)e~ xv ^^ 



*L 

< f _ dP(D) fl - 1 6 r XC V(D : /, /)) 

+ / _ dP{D) (l - ^-^V(D : f, f)) . 

JD:V(D:f,f)<0 \ Is / 



(45) 



Let 



and 



d^ f (l 



-AC 



)/C C 2 d ±\e 



,xc 



1)/C 



_ dP(D)V(D : /, /)(< C). 

V(D:f,f)>0 



KfJ)= ! 

JD: 

Then (45) can be further upper-bounded as follows 

/ dP(D)e- x ^ D:f ^ D '^ 

JD 



< 1 -d 



L 



dP(D)V(D : /, /) 



V(X?:/,/)>0 

dP(D)V(D : /, /) 



~C 2 f 

JD:V(D:f,f)<Q 

= i - c 2 A(f || /) + (c 2 - d)Kf 7) 

<1-C 2 A(/ || f) + (C 2 -Ci)C 

= 1 - C 2 A P (/) + C 2 A P (7) + (C 2 - d)C 



(46) 



where (46) follows from the relation: A(/ || /) = Ap(/) - 
Ap(/). Further note that 



and 



(C 2 - d)C = (e xc - \) 2 /e xc > 0 
(C 2 - d)C = C 2 C 2 /e xc < C 2 C 2 . 



Thus we have the following inequality for any / such that 
A P (/) > e: 

J dP{D)e- x ^ D ^' L ^ D ^ 

< 1 - C 2 e + C 2 C\ + C 2 A P (7) 

< exp[-C 2 (e - (C 2 d + Ap(7)))] (47) 

where (47) follows from the fact that for any A > 0, 1 — Ax < 
e~ Ax . Rewriting C 2 in (47) as h(\) yields (40). This completes 
the proof of Lemma 2. □ 

Note that bound (37) is general in the sense that it holds 
for all A > 0, while Barron's CR of the form of (35) has an 
upper bound on its statistical risk 



inf 



{iK/)^^)'] 



(48) 



under some constraints of A. As will be seen in Corol- 
lary 1, however, MLC also leads to the square-root regu- 
larization term (with respect to m) after making necessary 
adjustments to A to obtain the least upper bound on its 
statistical risk. In the end, MLC has the same performance as 
CR, while MLC has generality and allowance of the criterion 
to take the form of (34) rather than (35). (Note: A. R. 
Barron recently informed the author that the square root in the 
regularization term in (35) was not needed for the statistical 
risk bounds similar to those given in [2], as pointed out to 
him by Bartlett and Lee.) 

Let T be a set of all functions from X to Z, or a set 
of conditional probability densities or conditional probability 
mass functions over y for given X 6 X. Assume that for 
a given target distribution P, there exists a function /* that 
attains the minimum of Ep[L(D : /)] over all / in T. Letting 
TC k c T be a fc-dimensional parametric class, the parametric 
case is the case where /* is in Tik for some finite A;. The 
nonparametric case is the case where /* is not in Tik for any 
k < oo. Below, as a corollary of Theorem 3, we give upper 
bounds on the statistical risk for MLC both for the parametric 
and nonparametric cases. 

Corollary 1: Suppose that for each k y Tik and L satisfy 
Condition 17 and that for the quantization of Tik satisfy- 
ing (32), any quantization scale S = 6k) satisfies 
YlLi h = 0(l/(Am) fc / 2 ). Suppose also that L m (/) takes a 
constant value over Ti^K 

Parametric case: Assume that for the target distribution 
P, for some k* < oo, /* is in Tik* and is written as 
Then letting 

we have the following upper bound on the statistical risk for 
MLC: 



M = o((!^)" ! ). 



(49) 
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Nonparametric case: Assume that for the target distribu- 
tion P, for some a > 0, for each the optimal hypothesis 
/* can be approximated by a Ar-dimensional subclass 7ik of 
H with error ini f€Hk A P (/) = 0(1/A; a ). Then letting 



we have the following upper bound on the statistical risk for 
MLC: 



flArffll -<>((*=) J 



(50) 



In the special case, where a is known to MLC in advance, 
letting A = (1/C) ln(l + C((ln m)/™) 0 ^ 20 * 1 )), we have 
the following upper bound on the statistical risk for MLC: 

a/(2cH-l)> 

E[A P (f)l=0[(^p\ ). (51) 



Proof: First consider the parametric case where /* is 
written as f$* £ Tik* for some k*. For each A:, let 
be a quantization of *H k with quantization scale 6 such that 
rL = i<*i = ^(l/(^) fc/2 ) for sample size m. Let / be the 
truncation of /* € Wk* • Then we see that 

L m (T) = 0(k* ln(mA)) = Q(k* In m). 

If we set A = (1/C) ln(l + C^lnrnJ/m) 1 / 2 ), then we 
have h(X) = 0(((lnm)/ro) 1/2 ). Under Condition 17 we can 
use (37) and the fact that Ap(/*) — 0 to upper-bound the 
statistical risk as follows: 

(Lm(D+Bfc* + l) 



E[A P (f))<C 2 h(\) + A P (D + 



m/i(A) 



which yields (49). 

Next consider the nonparametric case where 

inf A P (f) = 0{l/k a ). 

We can use (37) to obtain the following bound on the statistical 
risk for MLC: 



£?[Ap(/)]<min 



inf \ 



C 2 h(\) + A P (fe) 
(L m (f e ) + Bk+1) 



+ 



mh(\) 



} 



which yields (50). The minimum in (52) is attained by k = 
0((m/ln m) 1/2(a+1 >). Bound (51) can be obtained similarly 
with (50). □ 

For the parametric case, the convergence rate bound (49) 
coincides with that obtained for Barron* s CR with respect to m 



'HMATION THEORY, VOL. 44, NO. 4, JULY 1998 



(see [2]). This bound is fastest to date. For the nonparametric 
case, (51) is slightly better than (50) and coincides with that 
obtained for CR in [2], which is fastest to date. Note that (51) 
is attained by CR even if a is not known to CR in advance. 

For the quadratic loss function and logarithmic loss function, 
MLC takes exactly the same form as CR. Then (49) and 
(50) can be improved to 0((ln m)/m) for the parametric 
case and 0(((ln m)/ra) c */( Qf+1 )) for the nonparametric case, 
respectively, where a is an index as in Corollary 1 (see the 
analysis in [2]). 



V. Concluding Remarks 

We have introduced ESC as an extension of stochastic 
complexity to the decision-theoretic setting where a general 
real- valued function is used as a hypothesis and a general loss 
function is used as a distortion measure. Through ESC we 
have given a unifying view of the design and analysis of the 
learning algorithms which have turned out to be most effective 
in batch-learning and on-line prediction scenarios. 

For the on-line prediction scenario, a sequential realization 
of ESC induces the multiplicatively weight-decaying algorithm 
called the aggregating strategy AGG. This corresponds to the 
fact that a sequential realization of SC induces the Bayes 
algorithm for the specific case where the hypothesis class 
is a class of probability densities and the distortion measure 
is the logarithmic loss. We have derived upper bounds on 
both the expected and worst case cumulative losses for AGG, 
for the case where the hypothesis class is specified by a 
continuous parameter. These bounds are the first ones derived 
in a general decision-theoretic setting. Haussler et al. [11] 
derived a worst case lower bound on the cumulative loss 
for any on-line prediction algorithm using a finite hypothesis 
class. They showed that a version of ESC defined relative 
to a finite hypothesis matches their lower bound within error 
o(l). However, for the case where the hypothesis class is 
continuous, it is an open problem to derive a tight lower 
bound on the cumulative loss. Recently this problem has been 
addressed in [29] and it has turned out that under certain 
conditions for a hypothesis class, there exists a lower bound 
on the worst case cumulative loss for any on-line prediction 
algorithm using the hypothesis class, which matches ESC 
within o(ln m) for sample size m. 

For the batch-learning scenario, a batch-approximation of 
ESC using a single hypothesis induces the learning algorithm 
MLC, which is a formal extension of the MDL learning 
algorithm. We have derived upper bounds on the statistical 
risk for MLC with respect to general bounded loss functions. 
It has turned out that MLC has the least upper bounds (to 
date) on the statistical risk by tuning A optimally. It remains 
for future study to derive a tight lower bound on the statistical 
risk for MLC to compare it with our upper bounds. 

Throughout this paper it has turned out that A in the 
definition of ESC plays an important role both for on-line 
prediction and batch learning. In the on-line learning scenario, 
A is determined so that AGG never fails, while in the batch- 
learning scenario, it is determined so that the rate of the 
convergence for MLC is made fastest. Hence, A must be tuned 
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depending on the learning scenario although there currently 
seems to be a lack of unifying understanding of A. 

ESC might often be computationally or analytically hard 
to compute when the hypothesis class is specified by a 
complicated constraint for real-valued parameters, e.g., neural 
networks and hierarchically parametric models. There remains 
an important computational issue of how to approximate 
ESC, considering the tradeoff between the accuracy of the 
approximation to ESC and its computational efficiency. 
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